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ABSTRACT 

Summary: We introduce cmine, a novel implementation in ANSI C 
of the MINE family of algorithms for computing maximal information- 
based measures of dependence between two variables in large 
datasets. We also provide two interfaces, minerva and minepy, for 
the C engine through the R environment and the Python scripting 
language, respectively. The cmine solution reduces the large memory 
requirement of the first Java implementation for both the R and Python 
interfaces. Results on microarray and RNA-seq transcriptomics 
datasets are described. 

Availability and Implementation: Source code implemented in 
ANSI C (cmine) with wrappers in R (minerva) and Python (minepy) 
are freely available for download under GPL3 licence^. The R 
package minerva is also available through the CRAN repository. 
The Python library minepy is also in SourceForga^ All software is 
multiplatform (MS Windows, Unix/Linux and Mac OS X). 
Contact: furlan@fbk.eu 

Supplementary information: Supplementary information are 

available at the cmine website fhttp : / /mpba . fbk . eu/cmine| 



1 INTRODUCTION 

The Maximal Information-based Nonparametric Exploration (MINE) 
family of statistics, including the Maximal In formation Coefficien t 
(MIC) measure, was recently introduced in JReshef et all |2011[) . 
aimed at fast exploration of two-variable relationships in many- 
dimensional data sets. MINE consists of the algorithms for 
computing four measures of dependence — MIC, Maximum 
Asymmetry Score (MAS), Maximum Edge Value (MEV), 
Minimum Cell Number (MCN) — between two variables, having 
the generality and equitability property. Generality is the ability 
of capturing variable relationships of different nature, while 
equitability is the property of penalizing similar levels of noise 
in the same way, regardless of the nature of the relation between 
the variables. The MINE suite received immediate appraisal as 
a rea l breaktrough in the data mining of complex biological 
data dSpeed . l201ll) as well as criticisms dSimon and Tibshiranil 
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120121 : iGorfine et all I2012T) . Many groups worldwide have 
already proposed its use for explorative data analysis in 
computational biology, from networks interaction dynamics to 
virus ranking 1 Weiss ei tall 20121: Das_efoZ 1J2012I: Ande rson et all . 



l2012l : iKarpinets et all |2012t iFaust and Raesi 120121) . Together with 
the algorithm description, the MINE authors provided a Java 
implementation (MIN E.jar), two wrapper s (R and Python), and 
four reference datasets dReshef et aZll201lh . However, applicability 
of MINE.jar on all pairs of features on large datasets is c urrently 
limited due to memory requirements and computing time dMilleit 
120121) . It is also clear the interest for a native parallelization of 
MINE tasks, which is currently unavailable. These issues represent 
an obstacle for a systematic application of MINE algorithms to high- 
throughput omics data — for example, as a substitute of Pearson 
correlation in network studies. Inspired by these considerations, we 
propose cmine, a C implementation of the MINE algorithms, and 
two interfaces to cmine from R (minerva) or Python (minepy). 



2 THE MINE C ENGINE AND ITS WRAPPERS 

The cmine engine is written in ANSI C b y reimplemen t ing ex 
novo the algorithms originally described in dReshef et all 1201 lh 
and its supplementary material (the source code of MINE.jar is 
not distributed). The C code provides three structures describing 
respectively the data, the parameter configuration and all the 
corresponding maximum normalized mutual information scores. 
The core function mine ..compute score takes a dataset structure 
and a configuration structure as input, returning the score structure 
as output. Given a score structure, four functions compute the 
MINE statistics. The minepy Python module works with Python > 
2.6, 3.X, with Numpy > 1.3.0 as the sole requirement: the interface 
consists of the class minepy.MINE whose methods correspond to 
the cmine functions. The R package minerva is built as an R 
wrapper (R > 2.14) to cmine following the standard procedure 
detailed in dR Core Tearrl |2012h . The main function mine takes 
the dataset and the parameter configuration as inputs and returns 
the four MINE statistics. Minerva allows native parallelization: 
based on the R package parallel, the number of cores can be 
passed as parameter to the mine function, whenever multi-core 
hardware is a vailable. The curated ve rsion of t he CDC 15 Spellman 
yeast dataset dSpellman et all 1 19981) used in dReshef et "all 1201 ll) 
is included as example. Documentation (on-line or as a PDF) for 
cmine and minepy is available at the cmine website, while minerva 
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>>> # imports the numpy module 
>>> import numpy 

>>> # imports the minepy module 
>>> import minepy 

>>> # create x - [0, 0.001, 0.002, 0.998, 0.999, 1] 

>>> x = numpy . linspace ( , 1 , 1001) 

>>> # y = sin{10 * pi * x) + x 

>>> y = numpy. sin(10 * numpy. pi * x) + x 

>>> # build the MINE object 

>>> mine - MINE (alpha-0 . 6, c=15) 

>>> # computes the information scores 

>>> mi ne. score (x, y) 

>>> # returns the Maximal Information Coefficient (MIC) 

>>> mine . mic ( } 

.9999992800928936 

>>> # returns the Maximum Asymmetry Score (MAS) 

>>> mine .mas ( ) 

0.7281444902035837 

(a) 

> # load the minerva package 

> library (minerva) 

> # create x = c(0, 0.001, 0.002, 0.998, 0.999, 1) 

> x <- seq(0, 1, 0.001) 

> # y = sin(10 * pi * x) + x 

> y <- sin{10 * pi * x) + x 

> # computes the information scores 

> res <- mine (x, y, alpha-0. 6, C-15) 

> # returns the Maximal Information Coefficient (MIC) 

> res$MIC 
0.9999993 

> # returns the Maximum Asymmetry Score (MAS) 

> res$MAS 
0.7281445 

(b) 
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Fig. 2. Comparison of MIC values for variable #1 (time) vs. all the other 
4381 variables of the CDC15 Spellman yeast dataset by using the original 
MINE.jar (y axis) and minepy, with a=0.67 for both implementations. The 
small differences between the two implementations may be due to a varying 
number of clumps in the MINE.jar and to the different data type for the 
floating point values. 



Fig. 1. Code usage example: computing MIC and MAS on two vectors with 
minepy (a) and minerva (b). 



documentation is accessible in R as on-line help or from the CRAN 
repository. In Fig.[T]we show a usage example for both wrappers. 

Performance comparison The cmine suite was tested for consistency 
with the original MINE.jar v 1.0.1 implementation on the 
Spellman and microbiome reference datasets, available at 
|http : / /www . exploredata . net| For the CDC 15 Spellman 
yeast dataset, 4381 transcripts measured at 23 timepoints, we report 
in Fig. [2] the comparison of MIC values computed for all features 
pairs by using the original MINE.jar and minepy, with a=0.67 for 
both implementations. Most values are identical; few discrepancies 
{i.e., points deviating from the diagonal) may be due to a different 
implementation in the clump computation and to the different data 
type for the floating point values (java float versus C double). 

We performed the same all features pairs computations on 23 
time points and increasing feature set sizes with MINE.jar and the 
two cmine wrappers: the RAM and CPU usage are diplayed in 
Fig. [3] While MINE.jar cannot perform computation for the dataset 
instance with more than 1000 features, minerva and minepy fulfill 
all the tasks with a considerable RAM allocation saving (Fig.[3ja)). 
Computational times are comparable among all the methods even in 
the parallel implementation of minerva (Fig.[3jb)). 

On the microbiome dataset, we computed the MIN E functions for 
all the 6696 x 6696 pairs. Comparing with Tab S 13 of teeshef et ali 
l201lh . Supplementary Material (77 top ranked association pairs) we 



obtained 44 identical results and 73 values whose difference is less 
than 0.01. The median of all differences is 0, the 3rd quartile is 
0.003, and the largest observed difference is 0.014 (complete table 
available on the cmine website). 

We additionally tested the cmine suite on two recent high- 
throughput transcriptomics datasets, from Affymetrix HumanExon 
LOST of human brain tissues and Illumina Genome Analyzer 
II sequencing of human non-small cell lung cancer respectively. 
Numerical details on datasets and performance of computing the 
MINE statistics, first variable vs all the others, are reported in Tab.Q] 
Finally, we tested the scaling properties of minerva with varying 
values of the a parameter on two uniformly distributed random 

Table 1. Performance of cmine (one versus all) on microarray and 
sequencing datasets identified by GEO accession number and original 
reference, n: number of sample, p: number of features. CPU: Elapsed time 
used by the process (in seconds). RAM: resident set size, i.e., the non- 
swapped physical memory that a task has used (in kilobytes), for minerva 
(R) and minepy (P). 
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Number of features (dataset size kB) 



(a) 




Number of features (dataset size kB) 



(b) 

Fig. 3. (a) Resident set size, i.e., the non-swapped physical memory that a 
task has used (in MegaBytes) and (b) elapsed time used by the process (in 
seconds) versus increasing number of features (log scale) to simultaneously 
compute the MINE statistics for all pairs of features of the CDC 1 5 Spellman 
yeast dataset, comparing MINE.jar vl.0.1, minerva and minepy. MINE.jar 
can complete the task only for the first 3 datasets (with 200, 500, and 
1,000 features). The number in parentheses in the x-axis labels is the dataset 
dimension in kilobytes (kB). 



50 - 


a = 0.5 • • • 

a = 0.6 

a = 0.7 • • • 




40 - 






30 - 






20 - 






10 - 




— - • 


- 





number of samples 

Fig. 4. Average of the elapsed time on 100 repetitions computing the MINE 
statistics with the minerva package on 2 uniformly distributed random 
variables for an increasing number of samples and a = 0.5, 0.6, and 0.7. 

vectors with increasing length. In Fig. [4] we show the average of 
100 replicates for 5 different vector lengths. Due to the linearity in 
computing MINE statistics on n pairs of variables, Fig. [4] can be 
used to derive a rough estimate of the total time required to perform 
a MINE computation on a given dataset. 
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